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Abstract 

Background: Motivated by the general need to identify and classify species based on molecular evidence, 
genome comparisons have been proposed that are based on measuring Euclidean distances between Chaos 
Game Representation (CGR) patterns of genomic DNA sequences. 

Results: We provide, on an extensive dataset and using several different distances, confirmation of the 
hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA 
sequences originating from genomes of different species. This finding lends support to the theory that CGRs of 
genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over 
five hundred different 150,000 bp genomic sequences originating from the genomes of six organisms, each 
belonging to one of the kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; 
chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli 
(Bacteria - full genome), and P. furiosus (Archaea - full genome). We also provide preliminary evidence of this 
method’s applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over 
one hundred and fifty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome 
Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between 
CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps that 
visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display 
their interrelationships. 

Conclusion: Our analysis confirms that CGR patterns of DNA sequences from the same genome are in general 
quantitatively similar, while being different for DNA sequences from genomes of different species. Our analysis 
of the performance of the assessed distances uses three different quality measures and suggests that several 
distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In 
particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance 
(introduced here) are best able to classify genomic sequences. 

Keywords: comparative genomics; genomic signature; species classification 


Introduction 

Alongside DNA barcoding, [1] and Klee diagrams 
[2], Chaos Game Representation (CGR) patterns of 
genomic segments have been proposed as another 
method for the classification and identification of ge¬ 
nomic sequences [3-7]. The concept of genomic signa¬ 
ture was first introduced in [8], as being any specific 
quantitative characteristic of a DNA genomic sequence 
that is pervasive along the genome of the same organ¬ 
ism, while being dissimilar for DNA sequences origi¬ 
nating from different organisms. Initial studies [3,9], 
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suggested that short fragments of genomic sequences 
retain most of the characteristics of the species they 
come from, thus implying that genomic signatures ex¬ 
ist. Moreover, the Chaos Game Representation (CGR) 
of a DNA sequence, a graphic representation of its se¬ 
quence composition, was proposed in [3] as having both 
the pervasiveness and differentiability properties nec¬ 
essary for it to qualify as a genomic signature. This 
hypothesis was quantitatively tested and largely con¬ 
firmed in [4] for 3,176 mitochondrial DNA (mtDNA) 
sequences, and Molecular Distance Maps were pro¬ 
posed therein as vizualizations of species relationships 
based on measuring the distances between the CGR- 
images of their mtDNA genomes. Note that CGR pat- 
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terns of mtDNA sequences can be different from those 
of DNA sequences from the major genome of the same 
organism, and that large scale quantitative analyses of 
the hypothesis that CGR can play the role of a ge¬ 
nomic signature for genomic sequences have not, to 
our knowledge, been performed. The objective of this 
study is to confirm that CGR can play the role of ge¬ 
nomic signature for genomic DNA sequences, as well 
as to assess various distances that can be used to com¬ 
pare CGRs of genomic sequences. 

We analyze 508 fragments, 150 kbp (kilo base pairs) 
long, taken from complete genomic DNA sequences 
of six species, each representing a different kingdom: 
chromosome 21 of Homo sapiens , chromosome 4 of 
Saccharomyces cerevisiae , chromosome 1 of Arabidop- 
sis thaliana , chromosome 14 of Plasmodium falci¬ 
parum , the genome of Escherichia coli , and the genome 
of Pyrococcus furiosus , for a total length of 76,200,000 
bp analyzed. We analyze the inter genomic and intrage- 
nomic variation of CGR genomic signatures of these se¬ 
quences by using six different distances for image com¬ 
parison: Structural Dissimilarity Index (DSSIM) [10], 
Euclidean distance, Pearson correlation distance [11], 
Manhattan distance [12], approximated information 
distance [13], and a distance we propose here, called 
descriptor distance. We visualize the results by com¬ 
puting the Molecular Distance Maps of all DNA se¬ 
quences in the database, for each of the six distances. 
The resulting Molecular Distance Maps show a good 
clustering of the DNA sequences, with those origi¬ 
nating from the same genome being largely grouped 
together, and separated from sequences belonging to 
genomes of different organisms. We observe that, in 
some of the cases where the clustering was suboptimal, 
the computation of three-dimensional Molecular Dis¬ 
tance Maps resolves what appeared to be cluster over¬ 
laps in the two-dimensional Molecular Distance Maps. 
Lastly, using the “ground-truth” that sequences from 
the same genomes should have similar structural char¬ 
acteristics and thus be grouped together, while those 
from genomes of different organisms should be sepa¬ 
rated, we assess the six distances by combining three 
different quality measures: correlation to an idealized 
cluster distance, silhouette accuracy, and histogram 
overlap. We conclude that DSSIM and the descriptor 
distance perform best according to these measures. We 
also provide preliminary evidence of this method’s ap¬ 
plicability to classifying genomic DNA sequences of 
closely related species by comparing the H. sapiens 
(chromosome 21) sequences with 168 genomic DNA 
sequences, 150 kbp long, from Pan troglodytes (chimp, 
chromosome Y), for an additional length of 25,200,000 
bp analyzed. Further research may lead to improve¬ 
ments of these distances for optimal genomic DNA se¬ 
quence identification and classification results. 


Note that other alignment-free methods have been 
used for phylogenetic analysis of DNA sequences. The 
initial reports on CGRs of genomic sequences [3,14] 
contained mostly qualitative assessments of CGR pat¬ 
terns of whole genes. In [7], several datasets of up to 
36 genomic DNA sequences were analyzed, and in [9] 
some various-length sequences were analyzed based on 
computing Euclidean distances between frequencies of 
their k- mers, for k = 1 ,..., 8 . Subsequently, [5] com¬ 
puted the Euclidean distance between frequencies of 
k- mers (fc < 5) for the analysis of 125 GenBank DNA 
sequences from 20 bird species and the American al¬ 
ligator. In [15], 27 microbial genomes were analyzed 
to find implications of 4-mer frequencies (k = 4) on 
their evolutionary relationships. In [13], 20 mammalian 
complete mtDNA sequences were analyzed using the 
“similarity metric”, for k = 7. Another study, [16], an¬ 
alyzed 459 bacteriophage genomes and compared them 
with their host genomes to infer host-phage relation¬ 
ships, by computing Euclidean distances between fre¬ 
quencies of k- mers for k = 4. In [17], 75 complete HIV 
genome sequences were compared using the Euclidean 
distance between frequencies of 6-mers (k = 6), in or¬ 
der to group them in subtypes. In [4] a dataset of 3,176 
complete mtDNA sequences was analyzed, and several 
Molecular Distance Maps were obtained using DSSIM 
and a value of k = 9. 

The main contributions of this paper are: 

• We tested and confirmed for an extensive dataset, 
of a total length of 101,400,000 bp, the hypothe¬ 
sis that CGR images of genomic DNA sequences 
can play the role of a (graphic) genomic signa¬ 
ture , meaning that they have a desirable genome- 
and species- specificity. The dataset comprised 
150 kbp long sequences taken from genomes of 
organisms from each of the six kingdoms of life, 
augmented by a set of same-length genomic se¬ 
quences from P. troglogytes as a test-case of this 
method’s applicability to closely related species. 

• We assessed the performance of six different dis¬ 
tances in this context, and this analysis included 
both same-genome and different-genome DNA 
fragment pairs. For several of these distances, the 
intragenomic values were overall smaller than in- 
tergenomic values, suggesting that this method 
could separate DNA genomic fragments belong¬ 
ing to different genomes, based on their CGRs. 

• We showed that several distances outperform the 
Euclidean distance, which has so far been al¬ 
most exclusively used for such studies. In par¬ 
ticular, we determined that the DSSIM distance 
and descriptor distance (introduced here), both of 
whom essentially compare the &-mer composition 
of DNA sequences (herein km 9), were best able 
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to differentiate sequences originating from differ¬ 
ent genomes in this dataset. 

• This study represents, to the best of our knowl¬ 
edge, the largest combined dataset size and value 
of k for this type of analysis. 

• Based on preliminary data, we suggest the use 
of three-dimensional Molecular Distance Maps for 
improved visualization of the simultaneous inter¬ 
relationships among similar or very distant DNA 
sequences. 

Methods 

In this section we first describe the dataset used for our 
analysis, then present an overview of the three main 
steps of the method, and conclude with a description 
of the six distances that we considered. 


letters “N”. In S. cerevisiae and E. coli there were no 
ignored letters, and in P. falciparum and P. furiosus 
the number of ignored letters is of the order of 0.001% 
of the length of the sequence. In H. sapiens this num¬ 
ber is 27%, and in A. thaliana is 0.54%. In H. sapiens , 
in particular, 96.4% of these ignored letters exist in 
centromeric and telomeric regions of the chromosome. 

The resulting genomic DNA sequences were di¬ 
vided into successive, non-overlapping, contiguous 
fragments, each 150 kbp long. When the last sequence 
was shorter than 150 kbp, it was not included in the 
analysis. This resulted in 234 fragments for H. sapiens , 
30 fragments for E. coli , 10 fragments for S. cerevisiae , 
201 fragments for A. thaliana , 21 fragments for P. fal- 
ciparum , and 12 fragments for P. furiosus , for a total 
of 508 DNA fragments, see Table 2. 


Dataset 

The dataset we used includes complete genomic se¬ 
quences from six organisms, each representing one of 
the six kingdoms of life, see Table 1. For additional 
information about the dataset see Appendix A. 



Organism 

NCBI Acc. Nr. 

1 

2 

3 

4 

5 

6 

H. sapiens, chrom. 21 (Animalia) 

E. coli (Bacteria) 

S. cerevisiae, chrom. 4 (Fungi) 

A. thaliana, chrom. 1 (Plantae) 

P. falciparum, chrom. 14 (Protista) 

P. furiosus (Archaea) 

NC_000021.8 

NC_000913.3 

NC.001136.10 

NC_003070.9 

NC_004317.2 

NC.018092.1 


Table 1 NCBI accession numbers of the dataset of the 
complete genomic DNA sequences considered, in increasing 
order of their NCBI accession number. 


Organism 

Length(bp) 

# Letters “N” 

# Fragments 

H. sapiens 

48,129,895 

13,023,253 

234 

E. coli 

4,641,652 

0 

30 

S. cerevisiae 

1,531,933 

0 

10 

A. thaliana 

30,427,671 

164,359 

201 

P. falciparum 

3,291,871 

37 

21 

P. furiosus 

1,909,827 

10 

12 


Table 2 Organism considered, total length of genomic 
sequence, number of ignored letters “N”, and number of DNA 
fragments (sequences) obtained by splitting each complete 
genomic DNA sequence into consecutive, non-overlapping, 
equal length (150 kbp) contiguous fragments. 

In order to have relatively comparable number of 
DNA sequences for each organism, we chose the longest 
chromosomes for all organisms except H. sapiens , for 
which the shortest chromosome was chosen. 

The DNA sequences in the NCBI database are rep¬ 
resented as strings of letters “A”, “C”, “G”, “T”, and 
“N” which represent the four nucleobases Adenine, 
Cytosine, Guanine, Thymine, and “unidentified Nu¬ 
cleotide” , respectively. For our analysis we ignored all 


Overview 

The method we used to analyze and classify the 508 
sequences of the dataset has three steps: (i) gener¬ 
ate graphical representations (images) of each DNA 
sequence using Chaos Game Representation (CGR), 
(ii) compute all pairwise distances between these im¬ 
ages, and (Hi) visualize the interrelationships implied 
by these distances as two- or three-dimensional maps, 
using Multi-Dimensional Scaling (MDS). 

CGR is a method introduced by Jeffrey [3] in 1990 
to visualize the structure of a DNA sequence. A CGR 
associates an image to each DNA sequence as follows. 
Starting from a unit square with corners labelled A, C, 
G, and T, and the center of the square as the starting 
point, the image is obtained by successively plotting 
each nucleotide as the middle point between the cur¬ 
rent point and the corner labelled by the nucleotide to 
be plotted. If the generated square image has a size of 
2 k x 2 k pixels, then every pixel represents a distinct 
k- mer: A pixel is black if the k -mer it represents oc¬ 
curs in the DNA sequence, otherwise it is white. CGR 
images of genetic DNA sequences originating from var¬ 
ious species show patterns such as squares, parallel 
lines, rectangles, triangles, and also complex fractal 
patterns, Figure 1. 

For step (%), a slight modification of the original CGR 
was used, introduced by Deschavanne [7]: a k -th or¬ 
der FCGR (frequency CGR) is a 2 k x 2 k matrix that 
can be constructed by dividing the CGR plot into a 
2 k x 2 k grid, and defining the element as the num¬ 
ber of points that are situated in the corresponding 
grid square. A first and second order FCGR are shown 
below, where N w is the number of occurrences of the 
oligonucleotide w in the sequence s. 

FCGl !■(») = (^ 
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FCGR 2 (s ) 



N C c 

N G c 

Ncg 

N G g 


Nac 

Ntc 

Xag 

Xtg 


N C a 

Xga 

Nct 

Xgt 


Naa 

N T a 

Nat 

N r p r p 


The (k + l)-th order can be obtained 

by replacing each element Nx in FCGRk(s) with four 
elements 


( Ncx Ngx \ 

V Nax Ntx J 


where X is a sequence of length k over the alphabet 
{A,C,G,T}. 





(a) H. sapiens 


(b) E. coli (c) S. cerevisiae 



(d) A. thaliana (e) P. falciparum 


(f) P. furiosus 




Figure 1 2 9 x 2 9 CGR i mages of 150 kbp genomic DNA 
sequences, of H. sapiens, E. coli, S. cerevisiae, A. thaliana, 
P. falciparum, and P. furiosus. 


For step (ii), after computing the FCGR matrices for 
each of the 150 kbp sequences in our dataset, the goal 
was to measure “distances” between every two CGR 
images. There are many distances that can be defined 
and used for this purpose, [18]. One of the goals of 
this study was to identify what distance is better able 
to differentiate the structural differences of various ge¬ 
nomic DNA sequences and classify them based on the 
species they belong to. In this paper we use six differ¬ 
ent distances: Structural Dissimilarity Index (DSSIM), 
descriptor distance (defined here), Euclidean distance, 
Manhattan distance, Pearson correlation distance, and 
approximated information distance. 

For step (in), after computing all possible pairwise 
distances we obtained six different distance matrices. 
To visualize the inter-relationships between sequences 
implied by each of the distance matrices, and to thus 
visually assess each of the distances, we used Multi- 
Dimensional Scaling (MDS). MDS is an information 
visualization technique introduced by Kruskal in [19]. 
Given as input a distance matrix that contains the 


pairwise distances among a set of items W, the out¬ 
put of MDS is a spatial representation of the items on 
a common Euclidean space wherein each item is rep¬ 
resented as a point and the spatial distance between 
any two points corresponds to the distance between 
the items in the distance matrix: Objects with a small 
pairwise distance will result in points that are close to 
each other, while objects with a large pairwise distance 
will become points that are far apart. For example, 
in [4] MDS was used in conjunction with DSSIM and 
CGR to produce Molecular Distance Maps that visu¬ 
ally display the simultaneous interrelationships among 
a set of full mitochondrial DNA sequences. 

The ideal Molecular Distance Map is a placement of 
n items as points in an (n — l)-dimensional space. The 
two-dimensional Molecular Distance Map is simply an 
approximation, a flattening of this highly-dimensional 
space onto the plane, which may sometimes result in 
erroneous positioning of some points. Increasing the 
dimensionality of the Molecular Distance Map often 
results in a more accurate representation of the real 
interrelationships between sequences, as embodied in 
the original distance matrix. 

Distances 

In this section we describe and formally define each of 
the six distances used in our analysis: DSSIM, descrip¬ 
tor distance (introduced here), Euclidean, Manhattan, 
Pearson, and approximated information distance. 

Structural Similarity Index, SSIM, was introduced 
in [10] for the purpose of assessing the degree of simi¬ 
larity between two images. Given two images X , Y as 
n x n matrices having as elements integers ranging in 
the interval [0, L], SSIM computes three factors (lumi¬ 
nance, contrast and structure) and combines them to 
obtain a similarity value. However, instead of comput¬ 
ing a global similarity between the two images, each 
image is divided into 11 x 11 sliding square windows 
(Y^ respectively) with i,j = 1, • • • , n — 10 which 
move pixel by pixel to eventually cover the entire im¬ 
age, and the SSIM similarity of any given pair of im¬ 
ages is computed by comparing their corresponding 
windows. In addition, an 11 x 11 circular symmet¬ 
ric Gaussian weighting function W £ M llx11 with a 
fixed standard deviation of 1.5, normalized to unit sum 
(Z)pLi SjLi Wpq = 1), is used. Then, the mean /i^j 
for y), variance cJ x ,i,j { a y,i,j for Y) and corre¬ 
lation (j X y,i,j are computed, as follows: 


n n 

fe« = EEM 

p =i <?=i 

[1] In this paper the items are the 150 kpb DNA se¬ 
quences analyzed. 
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\ 


11 11 

EE 

p= i g=i 


W vq (X% 




11 11 

= E E W v^ X % ~ 

P=l 9—1 

where denotes the (p, q) element of the matrix A. 
Based on these values, the luminance Z(X 2J , Y zjf ), con¬ 
trast c(X ZJ , Y zjf ) and structure s(X z ^Y^) are com¬ 
puted as 


l(X ij ,Y ij ) 


k"y,id d~ ^1 

M*,*J + Vy,i ,j + Cl 


c(X ij ,Y ij ) 


+ C *2 

a y,i,.j C 2 


s(X ij .Y ij ) 


&xy,i,j H - C3 


C 3 


where Ci = (0.01) 2 , C 2 = (0.03) 2 , C 3 = ^p-. Then, 
these three factors are combined to get 


SSIM(X ij ,Y ij ) = l(X ij ,Y ij )c(X ij ,Y ij )s(X ij ,Y ij ) 

and finally, the SSIM index used to evaluate the over¬ 
all image similarity is computed as 


.. n— 10 n—10 

SSIM(X,Y ) = j —^ £ £ SSIM(X*,YV). 

1 j —1 

In theory, the values for SSIM range in the interval 
[— 1 , 1 ] with the similarity being 1 between two identi¬ 
cal images, 0 , for example, between a black image and 
a white image, and —1 if the two images are negatively 
correlated; that is, SSIM(X, Y) = — 1 if and only if X 
and Y have the same luminance p and every pixel Xi 
of image X has the inverted value of the corresponding 
pixel yi = 2/i — Xi in Y. 

To compute the distance rather than the similarity 
between two images, we calculate DSSIM ( X , Y) = 
1— SSIM(X, Y). Consequently, the range of DSSIM 
is the interval [ 0 , 2 ]: two identical images will result 
in a DSSIM distance of 0, while two images that are 
the negatives of each other would result in a DSSIM 
distance of 2 . 

The descriptor distance between two FCGRs 1,7 G 
N 2 x2 aims to compare a combination of several dif¬ 
ferent “descriptors”, that is, a combination of several 
different aspects, of the two given FCGRs. 


A descriptor is a vector characterized by parameters 
m and r, as well as r intervals, where m is the size 
of the non-overlapping windows in which the FCGR is 
divided (scale of the comparison), and the r intervals 
represent the “granularity” of the analysis, in that they 
define the intervals of numbers of k- mer occurrences 
that are considered significant. 

For a given m < k and r, and intervals [ao? ^i), [^i, ^ 2 ), 
••• , [a r _i, a r ) such that Ui=o[ a ^ a *+i) = [ 0 ? 00 ) and 
[ai,ai+ 1 ) fl [aj, aj+i) = 0 Vi, j with i ^ j, a decriptor 
is constructed as follows. 

Starting from the top-left corner, we divide each of 
the two FCGR matrices X and Y into non-overlapping 
submatrices^ of size 2 m x 2 m . This procedure re¬ 
sults in 4 fc-m submatrices Xij and Y^- with i,j = 
1 , • • • , 2 /c_m , which will be pairwise compared. 

The choice of the r intervals, called “bins”, points 
to the fact that, rather than considering the finest 
granularity, we are interested in a coarser compari¬ 
son. This means that, instead of a computationally 
expensive pairwise comparison of all possible numbers 
of occurrences of k- mers, we are interested only in cer¬ 
tain “bins” of such numbers. For example, in our case, 
we use r = 5 and consider only 5 different bins, that 
is only k- mers with number of occurences: 0 (not oc¬ 
curring), 1 (one occurrence), 2 (two occurrences), be¬ 
tween 2 and 5, between 5 and 20 , and greater than 
20 (most frequent). Formally, we use r = 5 and 
[0, 00 ) = [0,1) U [1, 2) U [2, 5) U [5, 20) U [20, 00 ) as the 
5 bins. 

Afterwards, we compute for every X^ a vector 
vec= ( 2 mx 2 m) (&i ; ft 2 ,--- A) where 6 * = \{x G 
X^ : di-i < x < ai}\. In our case, for each X^, we 
compute a five-tuple wherein, for example, the 4th el¬ 
ement represents the number of 9-mers whose number 
of occurrences is in the 4th bin, that is, at least 5 but 
less than 20. The division to 2 m x 2 m is to obtain a 
probability distribution for each submatrix. The same 
procedure is performed for Y^-, resulting in the vector 
vec Yij . 

We further append all vectors vec X^ and form a new 
vector vecX m,r and, using the same order of append¬ 
ing, we append all vectors vecforming a new vector 
vecY m,r . These two vectors are the “descriptors” of 
the FCGR matrices X and Y for the parameters m, r 
and the r chosen bins. 

As a last step, we combine descriptors vecX m,r (re¬ 
spectively vecY m,r ) for several values of m and r by 
appending them one after another, in the same order, 
to obtain the vector vecX (respectively vecY). 

[2] In general, these windows (submatrices) can be over¬ 
lapping, but in this paper we made the choice of using 
non-overlapping windows. 
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The descriptor distance between the two FCGRs X 
and Y is now defined as the Euclidean distance be¬ 
tween the vectors vecX and vecF 


d D (X,Y) = d E (vecX, vecT). 

In our case we computed descriptors for m = 4, 5, 6 
therefore forming vectors vecX and vecT of length 

5 ((|r) 2 + (W) 2 + (l^) 2 ) = 672 °- In general, 

for a given r, the length of the vectors compared 
is r((2 fc - mi ) 2 + (2 /c-m2 ) 2 + ... + (2 k ~ m p) 2 ), where 
mi, m2, • • •, tu p are the values used for m. The choice 
of m for this study was made to balance the com¬ 
putational cost of calculating the vector of descriptors 
with the ability to compare the two matrices at various 
scales: large (m = 6, that is, compare windows of size 
64 x 64), medium (m = 5, windows of size 32 x 32)) and 
small (m = 4, windows of size 16 x 16). The parameter 
r = 5 and the 5 bins were kept constant throughout 
our calculations but, in general, these parameters can 
also be varied, and the resulting vectors for each value 
added to the vector of descriptors, resulting in a larger 
vector. 

In principle, the descriptor distance between two FC¬ 
GRs effectively compares the distribution of frequen¬ 
cies of k- mers between the corresponding submatrices 
Xij and Yij) and does that for several values of m, 
that is, at several different scales. (Note that, in each 
window X^ , all k- mers have the same suffix of length 
k — m.) 

We now illustrate the descriptor distance by an ex¬ 
ample wherein & = 3, m = 2, r = 3, and the 3 bins are 
[0,15) U [15,30) U [30, oc). Since k = 3, the FCGR table 
will contain the number of occurrences of all 3-mers in 
a DNA sequence, as follows: 


CCC 

GCC 

CGC 

GGC 

CCG 

GCG 

CGG 

GGG 

ACC 

TCC 

AGC 

TGC 

ACG 

TCG 

AGG 

TGG 

CAC 

GAC 

CTC 

GTC 

CAG 

GAG 

CTG 

GTG 

AAC 

TAC 

ATC 

TTC 

AAG 

TAG 

ATG 

TTG 


CCA 

GCA 

CGA 

GGA 

CCT 

GCT 

CGT 

GGT 

ACA 

TCA 

AGA 

TGA 

ACT 

TCT 

AGT 

TGT 

CAA 

GAA 

CTA 

GTA 

CAT 

GAT 

CTT 

GTT 

AAA 

TAA 

ATA 

TTA 

AAT 

TAT 

ATT 

TTT 


Take the two FCGRs X,Y e N 8x8 , (k = 3, thus 
2 3 x 2 3 ) corresponding to two genomic 150 kbp se¬ 
quences of our dataset (one human and one bacterial), 
respectively. In order to use small numbers throughout 
the example, we divide all elements of the obtained ma¬ 
trices by 100 and take the integer part of each element, 
obtaining: 


X = 


Y = 








42 

33 

9 

33 

14 

10 

15 

45 

22 

30 

26 

25 

9 

5 

37 

37 

32 

21 

33 

19 

44 

35 

41 

35 

17 

9 

13 

21 

23 

10 

22 

18 

37 

26 

6 

32 

34 

24 

9 

23 

29 

24 

31 

27 

19 

27 

18 

28 

21 

23 

10 

9 

19 

17 

21 

15 

35 

15 

14 

14 

19 

12 

17 

30 


18 

34 

40 

27 

30 

36 

27 

12 

27 

18 

27 

32 

24 

23 

15 

23 

24 

17 

13 

17 

36 

12 

32 

18 

27 

17 

28 

26 

18 

8 

22 

25 

32 

32 

23 

16 

16 

25 

23 

22 

20 

29 

18 

25 

16 

16 

15 

17 

25 

25 

7 

16 

26 

27 

20 

25 

32 

21 

20 

21 

25 

18 

27 

34 


Thus, in the human DNA sequence, the triplet CCC 
appears about 4200 times, the triplet GCC appears 
about 3300 times, the triplet CGC appears about 900 
times, etc. 

Since m = 2, we divide each of the matrices X and Y 
into non-overlapping submatrices of size 4x4 (2 2 x 2 2 ). 
For X we thus obtain Xu, X12, X21, X22 


42 

33 

9 

33 \ 


( 14 

10 

15 

45 \ 

22 

30 

26 

25 


9 

5 

37 

37 

32 

21 

33 

19 

5 

44 

35 

41 

35 

17 

9 

13 

21 / 

^ 23 

10 

22 

18 ) 


/ 37 

26 

6 

32 \ 
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9 
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24 

31 
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19 

27 

18 

28 

21 

23 

10 

9 

5 

19 

17 

21 

15 

\ 35 

15 

14 

14/ 

^ 19 

12 

17 

30 / 


and similarly for Y. 

Since the r = 3 bins are [0,15) U [15,30) U [30, 00), 
we will count, for each submatrix, the number of 3- 
mers for which the number of occurrences is less than 
15, between 15 and 30, and greater than or equal to 
30. Thus we obtain vecXn = ^(3,7,6) which has 
as elements the number of elements of Xu which be¬ 
long in each of the intervals selected, divided by the 
total number of elements of Xu. We proceed simi- 
larly for vecX^ = ^(5,4,7), vecX 2 i = ^(5,7,4), 
vecX 22 = yg (2.12. 2) and we form vecX by appending 
these vectors one after the other, that is 


vecX = ^ (3,7,6,5,4,7,5,7,4,2,12,2). 
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We apply exactly the same procedure for the matrix In [13], for a given k , the information distance for two 
Y and we get strings x,y is defined as 


vecT = 


16 


(1,12,3,3,9,4,1,12,3,0,15,1). 


dAiD{x,y ) 


N k {x\y) + N k (y\x) 
N k (xy) 


The descriptor distance between these two FCGRs is 
computed as the Euclidean distance between vecX and 
vecT, in this case dr>(X,Y) ~ 0.718. Note that, since 
we started by dividing the number of 3-mer occur¬ 
rences by 100, as well as because of the bin selection, 
this is a fictitious example. The real value of the de¬ 
scriptor distance between the mentioned human and 
bacterial sequences is 8.66, and the range of the de¬ 
scriptor distance for this dataset of DNA sequences is 
[0, 13.17]. In general, the descriptor distance has a vari¬ 
able range, that depends on the choices of parameters 
used. 

To compute the Euclidean, Manhattan and Pearson 
distances, we first convert the matrices 1,7 G N nxn 
into 1 x n 2 vectors. For two vectors x,y E M n , their 
Euclidean distance dp{x^y) and their Manhattan dis¬ 
tance d M {x,y) are computed as 


with 

N k (x\y) = N k (xy) - N k (x) 

where Nk(x) is the number of different k -mers (pos¬ 
sibly overlapping) which occur in x. We go one step 
further and modify this in order to avoid the creation 
of “unwanted” k -mers from the concatenation xy of 
x and y. First, we need to show how we compute 
Nk(x) for a sequence x. For a sequence x, firstly, we 
build its FCGR(x) = X E N 2 * x2 \ which is a ma¬ 
trix of 2 k x 2 k with element values in N. Then we 
unitize X, that is every non-zero entry becomes 1, 
while zeros remain 0. Nk(x) is now computed as the 
sum of the elements of this unitized FCGR, that is, 
Nk(x) = /(X) = SumOfElements(Unitize(X)). For 
two strings x and y , with FCGRs X and Y respec¬ 
tively, we define Nk(x\y) as: 


ds(x,y) 


\ 


FA* - ^) 2 


i= 1 


n 

d M {x,y ) = F l Xi “ & l> 

i= 1 

while their Pearson distance dp(x,y) is defined as 


N k (x\y) = f(X + Y)*N k (x) (1) 

This slight modification of the information distance 
gives us also the desired properties of d(x, x) = 0 and 
d(x,y) = d(y,x) which were not satisfied before. Us¬ 
ing (1), we now define the approximated information 
distance (AID) as: 


dp(x, y) = 1 - 


&xy 

®x®y 


where 


^ 76 Y L 

Mx = -F^ > °* = A —[F(^“^) 2 > 

i=1 \ i= 1 

1 x n A 

&xy “ ^ ^ (*A ^x)(Vi k'y) ’ 

i= 1 

In theory, the correlation coefficient ° xv ranges in 
the interval [—1,1], and therefore the Pearson distance 
ranges in the interval [0, 2]. 

The last distance we considered is based on the in¬ 
formation distance defined in [13]. The use of this dis¬ 
tance is motivated computationally since it is easily 
computed from FCGRs as it tracks the number of dif¬ 
ferent k -mers for a sequence instead of the actual set. 


dAiD^x , y) = 2 — 


/po+/cn 

f(X + Y) 


( 2 ) 


where x,y are the strings and X,Y E N 2 ^ 2 ^ their 
FCGRs, respectively. It also turns out that this dis¬ 
tance is in fact the normalised Hamming Distance of 
the unitized FCGRs X and Y. Note that, for two 
sets A and y , the normalized Hamming distance is 
—2— where A denotes the symmetric 

difference. 

The generation of CGR images, calculation of dis¬ 
tance matrices and creation of 2D and 3D Molecu¬ 
lar Distance Maps with MDS were done and can be 
tested with the code available in [20] written in Wol¬ 
fram Mathematica, version 9. The interactive webtool 
ModMap, [21], allows in-depth exploration of the 2D 
Mod Maps (Molecular Distance Maps) in this paper^. 

[3] When using the interactive webtool MoDMap, click¬ 
ing on a distance underneath a dataset will result in 
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Online Supplemental Material [20] includes all dis¬ 
tance matrices and the code used to produce all figures 
and plots in this paper. More details about the online 
resources can be found in Appendix B. 

Analysis and Results 

For our dataset, we use k = 9, that is, each DNA se¬ 
quence was represented as a 2 9 x 2 9 FCGR matrix. 
In practice, this means that the FCGR of a DNA 
sequence contains the full information regarding its 
Axmer sequence composition, for k = 1,2,..., 9. The 
length choice of 150 kbp and value of k = 9 is justified 
by the fact that, for a random sequence of length 150 
kbp, its CGR at resolution 2 9 x 2 9 has around half of 
the pixels black, and half white. 

Figure 2 depicts two-dimensional Molecular Distance 
Maps for the over five hundred DNA sequences in 
our dataset, computed using the DSSIM distance, de¬ 
scriptor distance, Euclidean distance, Manhattan dis¬ 
tance, Pearson distance and approximated informa¬ 
tion distance, respectively. Figure 3 depicts the corre¬ 
sponding three-dimensional Molecular Distance Maps 
for the same dataset. The projection of each three- 
dimensional map is chosen by hand in order to visually 
separate clusters of points which appear to be overlap¬ 
ping in the two-dimensional maps, as discussed below. 

We note that MDS is not a clustering method, as the 
clusters are defined beforehand by the coloring scheme 
used (blue for H. sapiens , green for E. coli , and so on). 
MDS simply tries to display visually the interrelation¬ 
ships between the given items, based on the pairwise 
distances in the distance matrix which is its input. 
Note also that an increase in dimensionality from 2 to 
3 can lead to a better cluster visualization. For exam¬ 
ple, if we compare the two-dimensional and the three- 
dimensional Molecular Distance Maps obtained using 
DSSIM, we see that points that appeared to be erro¬ 
neously mixed with each other in the two-dimensional 
map, Figure 2(a), (S. cerevisiae and P. falciparum se¬ 
quences mixed in with A. thaliana sequences) were in 
fact clearly separated from each other in Figure 3(a), 
the three-dimensional version of the Molecular Dis¬ 
tance Map. 

plotting the MoD Map of the dataset computed with 
that distance. On any particular MoD Map, clicking on 
a point will display a window with information about 
the subsequence represented by that point: its NCBI 
accession number, scientific name of the organism it 
originates from, and its CGR pattern. Clicking on the 
“From here” and “To here” buttons on two such se¬ 
lected windows will display the distance between the 
corresponding genomic subsequences in the distance 
matrix. 


Figure 4 displays the histograms of the pairwise in- 
tragenomic distances (dark blue and turqoise) and in- 
tergenomic distances (grey) of DNA sequences from 
H. sapiens and A. thaliana , obtained using each of 
the six distances. As noted, some distances seem to 
perform better than others. Visually, the poorest per¬ 
former for these two sets of sequences (from H. sapiens 
and A. thaliana ) seems to be the Euclidean distance 
wherein the intragenomic distances are as high as in- 
tergenomic distances, and no separation is visible. In 
contrast, DSSIM gives - for the same data - interge- 
nomic distances that are overall much higher than in¬ 
tragenomic distances, resulting in a clear classification 
of DNA sequences into the species they belong to. 

Table 3 displays the mean and standard deviation of 
distances between clusters Q and Cj, 1 < < 6, 

where a cluster Ci is defined as the set of all ge¬ 
nomic sequences from the genome of organism i, as 
labelled in Table 1. In each subtable, the diagonals 
represent the means and standard deviation for in¬ 
tragenomic distances, while the other entries are all 
intergenomic distances. From this table we see that 
for DSSIM, Manhattan and approximated information 
distance, the maximum of all the averages of intrage¬ 
nomic distances in this dataset is strictly smaller than 
the minimum of all the averages of intergenomic dis¬ 
tances. For the descriptor distance and Pearson dis¬ 
tance the previous statement does not hold but, for 
each pair of organisms, the two averages of intrage¬ 
nomic distances (e.g., human-human and plant-plant) 
are both lower than the average of the intergenomic 
distances (human-plant). For the Euclidean distance, 
none of the previous statements holds: For example, 
the average of the plant-plant intragenomic distances 
(element 4-4 in the Euclidean distance subtable of Ta¬ 
ble 3) intragenomic distances is 723, which is larger 
than 672, the average of the yeast-plant intergenomic 
distances (element 3-4 in the Euclidean distance sub¬ 
table of Table 3). The complete histograms of all pair¬ 
wise comparisons Ci — Cj can be found in Appendix C. 



Karamichalis et al. 


Page 9 of 14 


- 

1 

2 

3 

4 

5 

6 

1 

0.81 ± 
0.04 

0.99 ± 
0.01 

0.92 ± 
0.02 

0.91 ± 
0.03 

0.92 ± 
0.03 

0.91 ± 
0.02 

2 

- 

0.85 ± 
0.01 

0.97 ± 
0.01 

0.99 ± 
0.01 

0.99 ± 
0.01 

0 . 99 ± 0 . 

3 

- 

- 

0.87 ± 
0.01 

0.89 ± 
0.02 

0.91 ± 0 . 

0.91 ± 
0.01 

4 

- 

- 

- 

0.87 ± 
0.03 

0.9 ± 
0.02 

0.91 ± 
0.01 

5 

- 

- 

- 

- 

0.74 ± 
0.01 

0 . 94 ± 0 . 

6 

DSSIM 

0.83 ± 
0.01 


1 

3.76 ± 
1.69 

9.74 ± 
0.66 

5.92 ± 
1.14 

5.71 ± 
1.41 

9.33 ± 
1.23 

5.44 ± 
0.92 

2 

- 

2.5 ± 
0.28 

8.05 ± 
0.39 

9.1 ± 
0.55 

12.67 ± 
0.19 

9.38 ± 
0.41 

3 

- 

- 

2.12 ± 
0.08 

3.42 ± 
1.05 

9.48 ± 
0.31 

4.6 ± 
0.09 

4 

- 

- 

- 

2.75 ± 
1.33 

8.23 ± 
0.94 

4.94 ± 
0.76 

5 

- 

- 

- 

- 

1.53 ± 
0.14 

9.99 ± 
0.28 

6 

Descriptors 

2.4 ± 
0.32 


1 

756 ± 
498 

856 ± 
349 

756 ± 
361 

818 ± 
514 

3914 ± 
510 

812 ± 
356 

2 

- 

558±5 

674 ± 
17 

802 ± 
366 

4102 ± 
466 

696 ±18 

3 

- 

- 

564 ± 
11 

672 ± 
383 

3964 ± 
472 

633 ±20 

4 

- 

- 

- 

723 ± 
535 

3923 ± 
506 

748 ± 
372 

5 

- 

- 

- 

- 

999 ± 
276 

4085 ± 
468 

6 

Euclidean 

585 ±24 


1 

171 ± 
15 

222±5 

189 ± 
13 

188 ± 17 

213 ± 20 

191 ± 9 

2 

- 

175±2 

209±4 

219 ± 8 

252 ± 4 

218 ± 3 

3 

- 

- 

171±2 

177 ± 10 

206 ± 2 

184 ± 2 

4 

- 

- 

- 

172 ± 16 

200 ± 11 

188 ± 9 

5 

- 

- 

- 

- 

105 ± 3 

224 ± 2 

6 

Manhattan (in thousands) 

167 ±3 


1 

0.5 ± 
0.12 

0.97 ± 
0.02 

0.69 ± 
0.1 

0.64 ± 
0.12 

0.65 ± 
0.09 

0.81 ± 
0.06 

2 

- 

0.71 ± 
0.02 

0.93 ± 
0.02 

0.96 ± 
0.02 

0.98 ± 
0.01 

0.99 ± 
0.02 

3 

- 

- 

0.6 ± 
0.02 

0.6 ± 
0.07 

0.71 ± 
0.03 

0.75 ± 
0.02 

4 

- 

- 

- 

0.53 ± 
0.11 

0.63 ± 
0.09 

0.76 ± 
0.04 

5 

- 

- 

- 

- 

0.02 ± 
0.01 

0.94 ± 
0.01 

6 

Pearson 

0.64 ± 
0.03 


1 

0.65 ± 
0.03 

0.78 ± 
0.01 

0.7 ± 
0.03 

0.7 ± 
0.03 

0.76 ± 
0.04 

0.69 ± 
0.02 

2 

- 

0.67 ± 
0 . 

0.75 ± 
0.01 

0.77 ± 
0.02 

0.85 ± 
0.01 

0.77 ± 
0.01 

3 

- 

- 

0.67 ± 
0.01 

0.68 ± 
0.02 

0 . 74 ± 0 . 

0 . 69 ± 0 . 

4 

- 

- 

- 

0.67 ± 
0.03 

0.73 ± 
0.02 

0.69 ± 
0.02 

5 

- 

- 

- 

- 

0.64 ± 
0.01 

0.76 ± 
0.01 

6 

Approx. Information 

0.65 ± 
0.01 


Table 3 Mean and standard deviation of distances between 
clusters Ci — Cj for i,j = 1, 
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(a) DSSIM distance. (b) Descriptors distance. 
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(e) Pearson distance (f) Approx, inform, distance 

Figure 2 Two-dimensional Molecular Distance Maps of DNA 
genomic sequences from all six organisms in the dataset, 
obtained using DSSIM, descriptor, Euclidean, Manhattan, 
Pearson and aproximated information distance, respectively. 
Each point corresponds to a 150 kbp genomic sequence from 
H. sapiens (blue), E. coli (green), S. cerevisiae (red), 

A. thaliana (turqoise), P. falciparum (magenta), and 
P. furiosus (orange). 


Quality Measures for Distances 

In this section we present three quality measures that 
each evaluates the quality of the six distances con¬ 
sidered. In the data mining literature a wide range 
of quality measures for clusterings has been defined; 
see for example [22,23]. Most of these methods are 
designed to assess the quality of different automated 
clustering methods while using the same distance. Our 
set-up is different, as we use different distances while 
the clustering is fixed and given by the initial colour¬ 
coding of the sequence-representing points. Thus, we 
have to use other approaches to compare the distances 
we analyze. In particular, as the six distances have 
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(a) DSSIM distance. 


(b) Descriptors distance. 





(e) Pearson distance 


(f) Approx, inform, distance 


Figure 3 Three-dimensional Molecular Distance Maps of 
genomic DNA sequences from all six organisms in the dataset, 
obtained using DSSIM, descriptor, Euclidean, Manhattan, 
Pearson and approximated information distance, respectively. 
Each point corresponds to a 150 kbp genomic sequence from 
H. sapiens (blue), E. coli (green), S. cerevisiae (red), 

A. thaliana (turqoise), P. falciparum (magenta), and 
P. furiosus (orange). 


different ranges, we have to use assessment methods 
which are invariant to the scale of the distance. 

The “ground-truth” that we use as a basis for our 
distance assessment is the fact that the “ideal” clus¬ 
tering of DNA sequences and the points that repre¬ 
sent them is known: sequences from the same organism 
should be close to one another and far from sequences 
originating from other organisms. (This assumption is 
justified - for this dataset - as the six organisms con¬ 
sidered are very different from one another, belonging 



(a) DSSIM distance. (b) Descriptors distance. 



(c) Euclidean distance (d) Manhattan distance 



(e) Pearson distance (f) Approx, inform, distance 


■ Homo sapiens 

□ Arabidopsis thaliana 

□ Homo sapiens-Arabidopsis thaliana 


Figure 4 Histograms of pairwise intragenomic and 
intergenomic distances among the DNA sequences from 
H. sapiens and A. thaliana. 


to different kingdoms of life.) Thus, an optimal dis¬ 
tance should yield a relatively small distance between 
two FCGRs which were generated from the DNA se¬ 
quences originating from the same organism, and rel¬ 
atively high distances between two FCGR originating 
from DNA sequences coming from different organisms. 

In order to assess each of the six distances quantita¬ 
tively, we computed three quality measures which rate 
different features of a distance: 

• the correlation to an idealized cluster distance 

• the silhouette cluster accuracy 

• the relative overlap between the intragenomic and 
intergenomic distance histograms. 

Let us stress that all three quality measures of the six 
distances are based on the distance matrices which we 
computed and not on their MDS plots. We will define 
the three quality measures such that their expected 
values range in the interval [0,1] where higher values 
correspond to better performance. 

Let us first describe the three quality measures infor¬ 
mally. An idealized distance is a distance that would 
be able to differentiate DNA sequences by species, that 
is, a distance S for which S(x,y) = 0 if x and y are 









































































Karamichalis et al. 


Page 11 of 14 


sequences from the same species and 8{x,y) = 1 oth¬ 
erwise. The first quality measure, the correlation to 
an idealized cluster distance, measures how well a dis¬ 
tance is linearly correlated to the idealized distance 8. 
The second quality measure, silhouette cluster accu¬ 
racy, is the percentage of points that are best embed¬ 
ded in the cluster they belong to. The third quality 
measure quantifies the “visual overlap” between the 
intragenomic and inter genomic distance histograms. 
Given our dataset, it is reasonable to expect that a 
good distance gives a low value if applied to FCGRs of 
genomic sequences of the same organism, and a high 
value when applied to FCGRs of genomic sequences 
from two different organisms, thus separating the his¬ 
tograms of intragenomic distances from that of interge- 
nomic distances. This is illustrated by the histograms 
in Figure 4, where a high overlap between the graph 
of intragenomic distances (dark blue and turquoise) 
and the graphs of intergenomic distances (grey) is an 
indication of a poorly performing distance. In a theo¬ 
retically optimal situation, there would exist a value c 
such that all distances that are smaller than c are in¬ 
tragenomic distances and all distances that are larger 
than c are intergenomic distances. This can usually not 
be expected from real data, but a low overlap between 
histograms is nevertheless indicative of a “good” dis¬ 
tance. 

In order to formally define the three quality mea¬ 
sures, we consider a dataset V which is partitioned 
into p non-overlapping clusters C \,..., C p for which a 
distance d a : V x V —> M>o exists. The cardinalities of 
the sets are \V\ = m and \Ci\ = m* for i = 1,... ,p. 
In our analysis, p = 6 and C\ contains all FCGRs 
generated from genomic DNA sequences from H. sapi¬ 
ens, C2 contains all FCGRs generated from genomic 
sequences of E.coli , and so on, according to the order 
in Table 1. The distance d a is one of the six distances 
a G {DSSIM, D, E, M, P, AID}. 

The correlation to an idealized cluster distance is 
computed as follows. We define the idealized cluster 
distance as a function (or matrix) 8: V x {0,1} 

such that 8(x,y) = 0 if and only if x and y belong to 
the same cluster, and 8{x,y) = 1 otherwise. Because 
we can view d a and 8 as discrete, symmetric functions 
which have the same domain, we can compute their 
correlation coefficient. We define the correlation of 8 
to d a to be the Pearson correlation of 8 and d a . More 
precisely, the upper triangular part of the matrix cor¬ 
responding to a distance d a is interpreted as a vector 
(x \,..., x n ) and compared with the corresponding val¬ 
ues (?/i,..., y n ) given by 8. We obtain the (^-correlation 
as 


V 


a 


&xy 


The correlation ranges in the interval [—1,1]: a value 
of 1 means that d a and 8 are linearly correlated, and 
a value of 0 means that they are unrelated. In other 
words, if the value obtained by measuring the correla¬ 
tion of a given distance to the idealized cluster distance 
is close to 1, this means that the given distance is closer 
to the idealized cluster distance, and hence, performs 
well. Note that negative values for this measure are not 
expected as this would imply that d a and 8 were neg¬ 
atively related (d a would perform worse than a matrix 
containing random entries). 

The silhouette cluster accuracy is based on the sil¬ 
houette coefficient , defined in [24], as a measure that 
determines how well a single point is embedded in the 
cluster to which it belongs. For a point x from cluster 
Ci we define a x as the average distance of this point 
to all other points in Ci, that is, 

a x = ——7 Y' d a (x,y), 

rrii — 1 ^ 

y£Ci,y^x 

and we define b x as the minimum over the average 
distances of x to all points of a different cluster 


b 


X 


K 

min 


— V d a (x,y ) 

TYlj “ 

3 yeCj 


The silhouette coefficient of x is defined as 


r 7 'l • 

ma x{a x ,b x \ 

If a point x has a silhouette coefficient S a (x) < 0, 
then x is at least as close to a cluster to which it does 
not belong than to its own cluster. The silhouette clus¬ 
ter accuracy A a denotes the percentage of points with 
a silhouette coefficient greater than 0, that is the per¬ 
centage of points which are well-embedded in their own 
cluster, 


A _ \i x £ V I s »(x) > 0}| 

irtjgS 

m 

Obviously, the silhouette cluster accuracy ranges in 
[0,1] with a high accuracy being desirable. 

For assessing the relative overlap of the histograms, 
consider any two clusters Ci and Cj with i ^ j (for 
example, C\ is the H. sapiens cluster and C 4 the 
A. thaliana cluster). We compare the two sets of in¬ 
tragenomic distances Ci~Ci and Cj-Cj with the set 
of intergenomic distances Ci~Cj. For a distance d a , 
we divide the range from min (d a ) to the maximum 
distance ma x(d a ) in this dataset into 100 bins of size 
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r = max (^)~ Q mm (^) anc [ coun t the distances which fall 
into this bin: [£\ denotes bin £ containing distances 

from Ci~Ci and Cij[£\ denotes bin i containing dis¬ 
tances from Ci~Cj. For £ = 1,..., 100 we let 


Ci',j'[£] = | {{x,y} | X e Ci*,y G and x^y 
and (£ — 1) • r < d a (x, y) < £ • r} \. 


By s^jf we denote the sum over all Q/j/-bins = 

Ci',j'[£}• We define the relative overlap O a (i,j) of 
Ci-Ci (intragenomic distances) with Ci~Cj (interge- 
nomic distances) as 


O a (i,j) 


min { s M> s i,j} max{c iji , Cij} 


The relative overlap O a (j,i) of Cj-Cj with Ci~Cj is 
defined analogously; note that O a (i,j) ^ O a (j,i) in 
general. The overlap is normalized to the range [0,1] 
where 0 means no overlap of elements of bins between 
intra- and inter genomic distances, and 1 means that 
one of the histograms completely “covers” the other. 
Also note that we are not interested in the overlap of 
Ci~Ci with Cj-Cj as both sets of distances are intrage¬ 
nomic distances. 

Since we intend to define the a quality measure where 
a value close to 1 should represent a small overlap, we 
will use 1 — O a (i,j) as relative overlap. Furthermore, 
we combine these quantities for all possible pairs of 
clusters Ci and Cj, obtaining the relative overlap as: 


> i=l j=M/j 

For example, in Figure 4, for each of the considered 
distance, the dark blue histograms depict the C\ — C\ 
( H . sapiens - H. sapiens ) intragenomic distances, 
the turquoise histograms the C 4 — C 4 (A. thaliana 
- A. thaliana ) intragenomic distances, and grey his¬ 
tograms the C\ — C 4 ( H . sapiens - A. thaliana ) in- 
tergenomic distances. As seen from this figure, the de¬ 
scriptor distance appears to visually perform best at 
separating the two intragenomic distance histograms 
from the intergenomic histogram, while the Euclidean 
distance has the weakest performance. The relative 
overlap attempts to quantify this by computing the 
overlaps of each of the two pairs of histograms (dark 
blue with grey and turquoise with grey). Note that 
small visual histogram overlaps will result in a high 
numerical relative overlap , and is indicative of a bet¬ 
ter performing distance. 


Distance Comparison Results 

The results of comparing the six distances we ana¬ 
lyzed, using the three quality measures, are listed in 
Table 4. Recall that all quality measures have an ex¬ 
pected range of [ 0 , 1 ] where larger values imply better 
performance. 



a 

Act 

Oa 

z-score sum 

Rank 

DSSIM 

0.627 

1.000 

0.965 

1.895 

2nd 

Descriptors 

0.639 

0.976 

0.988 

2.509 

1st 

Euclidean 

0.231 

0.325 

0.907 

-4.831 

6th 

Manhattan 

0.527 

1.000 

0.951 

0.84 

3rd 

Pearson 

0.536 

0.980 

0.888 

-0.875 

5th 

Approx. Inf. 

0.527 

1.000 

0.937 

0.462 

4th 


Table 4 Summary of quality measures for the performances of six 
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson, 
approximated information distance) on a dataset of 508 genomic 
DNA sequences taken from organisms from each kingdom of life. 
D a is the correlation to an idealized cluster, A a the silhouette 
cluster accuracy, and O a the relative overlap. Higher is better. 


To compare each distance relative to all the other dis¬ 
tances, we further compute for each quality measure 
(each column) the standard scores (z-scores) of each 
distance d a , where a G {DSSIM, D, E, M, P, AID}, as 
z(d a ) = da ~^ where fi is the mean and a is the devi¬ 
ation of all six d a for that particular quality measure 
(column). A positive value of the standard score will 
mean that a distance performs above average (in this 
category) and a negative value that it performs below 
average. 

Finally, we compute the sum of the 2 :-scores for each 
quality measure as seen in Table 4. Note that the total 
of ^-scores for a distance represents the performance 
of that distance relative to the other distances, and 
indicates its relative ranking. 

The conclusion of this analysis is that the best 
performing distances are the descriptor distance and 
DSSIM. Manhattan, Pearson, and approximate infor¬ 
mation distance perform well in some categories but 
not so well in other categories. For this dataset and 
value of fc, the Euclidean distance had the weakest per¬ 
formance in all measured categories, which confirms 
the visual assessment of the MDS plots obtained by 
using the Euclidean distance, as seen in Figure 2 and 
Figure 3. 

It is worth noting that the two distances which per¬ 
form best (DSSIM and descriptor) treat FCGR ma¬ 
trices as two-dimensional maps in which the local ar¬ 
rangement of the cells (matrix entries) influences the 
computed distance, whereas the other distances treat 
the FCGR matrices as linear vectors. This suggests 
that the organization of the k -mer tallies (in this pa¬ 
per km 9) of a DNA sequence as an FCGR matrix, 
rather than a simple vector, reveals structural prop¬ 
erties of the DNA sequence that could be utilized in 
order to identify and classify genomic DNA sequences. 
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Discussion and Conclusions 

In this study we test the hypothesis that CGR-based 
genomic signatures of genomic DNA sequences are in¬ 
deed species and genome-specific. With this goal in 
mind we analyze over five hundred 150 kbp DNA 
genomic sequences originating from organisms repre¬ 
senting each of the kingdoms of life. Our quantita¬ 
tive comparison of six different distances suggests that 
several other distances outperform the Euclidean dis¬ 
tance, which has been until now almost exclusively 
used in such studies. Our preliminary results show 
that two of these distances, DSSIM and descriptor dis¬ 
tance (introduced here) when applied to CGR-based 
genomic signatures, have indeed the ability to differ¬ 
entiate between DNA sequences coming from different 
species. This indicates that the k -mer sequence compo¬ 
sition (where k = 1, 2,..., 9) of genomic sequences con¬ 
tains taxonomic information which could potentially 
aid in the identification, comparison and classifica¬ 
tion of species based on molecular evidence. The two- 
dimensional and three-dimensional Molecular Distance 
Maps we obtain, which visualize the simultaneous in- 
tragenomic and intergenomic interrelationships among 
the sequences in our dataset, show this method’s po¬ 
tential. 

Further analysis is needed to explore this method’s 
potential to the analysis of closely related species. 
As a preliminary experiment, we applied it to H. 
sapiens chromosome 21 (NC-000021.8), which yields 
234 fragments, and P. troglodytes chromosome Y 
(NC_006492.3) which yields 168 sequences, also 150 
kbp long. 



'Dot. 

Acx 

Oa 

z-score sum 

Rank 

DSSIM 

0.167 

0.915 

0.136 

3.453 

1st 

Descriptors 

0.015 

0.500 

0.101 

-2.593 

5th 

Euclidean 

0.037 

0.58 

0.069 

-2.899 

6th 

Manhattan 

0.112 

0.863 

0.108 

1.27 

3rd 

Pearson 

0.142 

0.714 

0.119 

1.339 

2nd 

Approx. Inf. 

0.075 

0.933 

0.062 

-0.569 

4th 


Table 5 Summary of quality measures for the performances of six 
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson, 
approximated information distance) on a dataset of 402 DNA 
sequences from H. sapiens, chromosome 21 and P. troglodytes, 
chromosome Y. V a is the correlation to an idealized cluster, A a 
is the silhouette cluster accuracy, and O a is the relative overlap. 

The Molecular Distance Maps in Figure 5 and Fig¬ 
ure 6, of 402 DNA sequences, suggests that several of 
the distances are able to differentiate even between 
DNA sequences from closely related organisms. As 
seen in Table 5, the Euclidean distance was again out¬ 
performed by other distances, when assessed with the 
quality measures we described. In this case-study, we 
note a change in the distance rankings: DSSIM, which 
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(a) DSSIM distance. 

(b) Descriptors distance. 
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(c) Euclidean distance. 

(d) Manhattan distance. 
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(e) Pearson distance. 

(f) Approx, inform, distance. 

Figure 5 Two-dimensional Molecular Distance Maps of 

150 kbp genomic DNA sequences from H. sapiens (blue), 

P. troglodytes (red) using the six distances. 


ranked second previously, now ranks first, while the 
descriptor distance, which ranked first previously, now 
ranks second last. This may be an indication that de¬ 
scriptor distance, which was designed to detect pattern 
differences, may only perform well for analyses of se¬ 
quences of distantly related organisms while DSSIM, 
which is sensitive to small differences in similar images, 
may be the preferred option for fine-grained analyses 
at the genus, family and species level. 

Further large-scale computational experiments have 
to be carried out to confirm these preliminary results 
and establish their validity. Such experiments could 
provide additional insights regarding the choice of op¬ 
timal distance for structural genome comparison in 
different settings. 
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Figure 6 Three-dimensional Molecular Distance Maps of 
150 kbp genomic DNA sequences from H. sapiens (blue), 
P. troglodytes (red) using the six distances. 
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